Data Analysis AirBnB - Porto & Lisbon - Listings Dataset

Performing exploratory data analysis for a maximum of three features, comparing the listings in Lisbon and Porto.

Development of a decision tree predicting the place availability for the next year (column “availability_365”). More specifically, predict whether the place will be available for at least 280 days out of the 365 days.

Load and Merge the datasets

Handling Missing Data

For simplicity I will drop all NaN rows.

Target Variable

Exploring DateTime Features

The latest the last review is, the more likely the entry is to be listed for at least 280 days availability. This means that listings with older reviews are more likely to have a Smaller than 280 days availability.

There seems to be a relation between some of the months and the number of entries with more than 280 days available for the year. Namely, January, August, September and October.

There is also some apparent ciclicality throughout the year, where January is the highest month, then a sharp decrease until April, followed by a slow increase until July, speeding up in the end of summer months (August, September and October) and then again a decrease to November and December.

By itself this feature does not split the data in two sections being very much balanced for the 4 categories. However it is noticeable that Entire home/apt is the most frequent category, followed by Private room which actually seems to be more present in Smaller than 280 days availability.

When combined with other features this may hold some usefull explanatory power.

A lower minimum number of nights seems to result in more entries with availability >280, this relation inverts as the minimum number of nights increases.

From bin 1 to 3 availabilities seem to be balanced, however, from 4 to 7 there is a clear shift toward Smaller than 280 days availability. Then this relationship inverts from 8 to 9 with at least 280 days availability becoming dominant and then both being balanced again in the 10th bin.

Feature Engineering

Model and Hyper-parameter Tunning

The binary tree structure has 15 nodes and dept of 3:

The first split uses feature minimum_nights_Percentile_num that corresponds to the minimum required nights spent at the given listing encapsulated into 10 equally distributed intervals that were transformed in hierarquical order.

This splits the minimum nights in half (0.999 to 15265.6) and (15265.6 to 25442.0), and as we have seen above in the EDA, around bin number 5 we have a significant increase in (0) target instances against (1).

Then, two nodes are created and node-1 splits into Year, and as we've seen, the more recent the last_review the more likely the listing is to be available for at least 280 days. And on node-2, we split by room type, that is balanced in its categories except in private-rooms that seem to be avaliable more times for less than 280 days.

Then node-1 divides again into 2 nodes for the final split, using in the left side minimum_nights_Percentile_num <= 2.5 i.e. (0.999 to 7633.3) and on the right room_type <= 1.5 that is to say room_type in ['Entire home/apt', 'Hotel Room'].

Node-2 on the right divides into minimum_nights_Percentile_num <= 7.5 i.e. (15265.6 to 20353.8). On the left and year <= 2018.5 on the right.

Summary of the above:

The Tunned Model has an overall balanced performace, minimizing Type I error (False Positives) and achieving a high recall (0.69) for the Smaller than 280 target. Given that this is a balanced dataset where the target variable is roughly distributed 50/50 between 0 and 1, this high recall for the (0) target is also very important.

Regarding the model, we say that it is a "pessimistic" model as it tends to for type II error (False Negatives).

It is also very interesting to notice that while the Hyper-parameter tunning allows for a max_dept of 10, the Tunned Model, maximizes the f1-score at max_dept=3 using all 3 features.

Next steps would be to evaluate changes in performace when allowing for even larger trees, and explore the impact from features, namely neighbourhood and months to check for possible cyclicality.

Reading the Decision Tree